Not interpreting server NotFound error as server down #4854

izaaklauer · 2023-07-21T22:02:20Z

How does the runner get stuck?

Here's the failure scenario:

1: The server and runner are both connected and fine.

2: A new job arrives, and the server decides to assign it to our runner.

3: Our first problem - the server has an issue assigning the job to a runner, and it's a NotFound-type error. We've seen it happen here, while getting config-sourcers.

waypoint/pkg/server/singleprocess/service_runner.go

Line 661 in 918df22

    
           return hcerr.Externalize(log, err, "failed to get the configuration for a dynamic source plugin to send with job assignment")

. There's likely another server bug at play here making it semi-regularily return NotFound errors, but fundamentally, returning a NotFound error here is something the server is allowed to do.

4: The runner receives this error, and because it's a NotFound error, it calls goto RESTART_JOB_STREAM (here), which we should only do if we've actually disconnected from the server. This is the bug. Relevant chunk:

waypoint/internal/runner/accept.go

Lines 278 to 287 in 918df22

    
           resp, err := client.Recv() 
        
           if err != nil { 
        
           	if atomic.LoadInt32(&canceled) > 0 || 
        
           		status.Code(err) == codes.Unavailable || 
        
           		status.Code(err) == codes.NotFound { 
        
           		goto RESTART_JOB_STREAM 
        
           	} 
        
           	return err 
        
           }

5: The runner follows the goto, and then blocks here forever:

waypoint/internal/runner/accept.go

Line 224 in 918df22

if r.waitStateGreater(&r.stateConfig, stateGen) {

Why is the runner blocking there, you ask? Here's my best understanding:

If this were a true disconnect, there is a second goroutine that would also be attempting to reconnect - the recvConfig loop (here). It has sophisticated reconnect logic, and if it were to also have seen the disconnect, it would reconnect, and then increment the runner-wide state version here, which would unblock the accept goroutine. @martisah and I verified this through observation.

That isn't happening here, because the recvConfig goroutine hasn't seen an error case, so it never has a reason to modify the runner state.

How does this fix it?

The primary cause is that the runner's accept goroutine is interpreting any NotFound error from the server as a disconnect event, which it expects all other goroutines will see. That's a bad assumption.

This PR modifies the accept goroutine to exit if it sees a NotFound error, rather than attempt a server reconnect. This raised a new bug: the top-level AcceptMany function was interpreting the server sending NotFound on the runner job stream as a deregistration event:

waypoint/internal/runner/accept.go

Lines 76 to 80 in 918df22

    
           case codes.NotFound: 
        
           	// This means the runner was deregistered and we must exit. 
        
           	// This won't be fixed unless the runner is closed and restarted. 
        
           	r.logger.Error("runner unexpectedly deregistered, exiting") 
        
           	return

That's also a bad assumption. The server in fact sends FailedPrecondition (here).

This fixes that second bug by again interpreting NotFound as a generic error. The AcceptMany NotFound check was introduced to prevent a runner DDOS event (here), and this continues to protect against that by moving the sleep before reconnect to the default case.

tl;dr: with this change, if the server responds to the runner with a NotFound error, the Accept loop will log the error, exit, and then be restarted successfully by AcceptMany, ready to accept another job.

How can I reproduce this bug?

Install a runner in ECS
Create a waypoint project with ODR and configsourcers, pointed to the HCP waypoint server
Set the static runner's concurrency flag to 1
Somehow get the server to respond with NotFound on the runner job stream. It's not yet clear the exact steps to reproduce this, but we got there by running some number of waypoint up's, destroying the project, recreating the project, and repeating.
At this point, the static runner should log server down before accepting a job, will reconnect, and then refuse to accept additional jobs. You will notice continued logs from waypoint.runner.agent.runner.config.watcher, but none from the accept loop.

Future considerations

We intend to push some of the debugging logging we introduced to find this issue
We should investigate why the server is responding NotFound on the job stream
We should verify further that runner deregistration still works. In my testing, I observed that with this change, the runner handles one additional job before exiting, but that may be preexisting behavior.

izaaklauer · 2023-07-21T22:03:27Z

internal/runner/operation_project_destroy.go

+
+	projConfig "github.com/hashicorp/waypoint/internal/config"
+	"github.com/hashicorp/waypoint/internal/core"
+	pb "github.com/hashicorp/waypoint/pkg/server/gen"


Unrelated auto-formatting fix

Co-authored-by: Martina Santangelo <[email protected]>

github-actions bot added the core label Jul 21, 2023

izaaklauer requested a review from martisah July 21, 2023 22:02

izaaklauer commented Jul 21, 2023

View reviewed changes

izaaklauer force-pushed the runner-notfound-err-handling branch from 0a9baf8 to 915362f Compare July 21, 2023 22:06

vercel bot deployed to Preview – waypoint-ui July 21, 2023 22:07 View deployment

Not interpreting server NotFound error as server down

bb671e8

Co-authored-by: Martina Santangelo <[email protected]>

izaaklauer force-pushed the runner-notfound-err-handling branch from 915362f to bb671e8 Compare July 21, 2023 22:14

vercel bot deployed to Preview – waypoint-ui July 21, 2023 22:15 View deployment

martisah approved these changes Jul 25, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not interpreting server NotFound error as server down #4854

Not interpreting server NotFound error as server down #4854

izaaklauer commented Jul 21, 2023 •

edited

Loading

izaaklauer Jul 21, 2023

	resp, err := client.Recv()
	if err != nil {
	if atomic.LoadInt32(&canceled) > 0 \|\|
	status.Code(err) == codes.Unavailable \|\|
	status.Code(err) == codes.NotFound {
	goto RESTART_JOB_STREAM
	}

	return err
	}

	case codes.NotFound:
	// This means the runner was deregistered and we must exit.
	// This won't be fixed unless the runner is closed and restarted.
	r.logger.Error("runner unexpectedly deregistered, exiting")
	return

Not interpreting server NotFound error as server down #4854

Are you sure you want to change the base?

Not interpreting server NotFound error as server down #4854

Conversation

izaaklauer commented Jul 21, 2023 • edited Loading

How does the runner get stuck?

How does this fix it?

How can I reproduce this bug?

Future considerations

izaaklauer Jul 21, 2023

Choose a reason for hiding this comment

izaaklauer commented Jul 21, 2023 •

edited

Loading